Motivation: theoretically relevant units \(\neq\) spatial units at which data are available
Example: data for different variables are available at different units
Example: borders, number of units change over time
Example: data are measured at different levels of geographic precision
Example: different definitions of same units across data sources
The dilemma for analysts
Conduct analysis at theoretically inappropriate units
- this is only possible if all data are available for those same units
or
Convert the data to a common set of (more appropriate) units
- this is an intermediate, messy step
- it always entails some information loss
- it can lead to measurement error and biased estimation of quantities of interest
- problem is well-known in geostatistics and social science
- but no best practices exist for implementation, comparison, evaluation
Changes of support
Definitions
What are change of support problems?
- Geographic support: area, shape, size, and orientation associated with a variable’s spatial measurement
- Change of support (CoS) problem: making statistical inferences about a variable at one support by using data from a different support
Related topics:
- ecological inference (EI): deducing micro variation from aggregate data
- modifiable areal unit problem (MAUP): statistical inferences depend on the geographical regions at which data are observed
EI and MAUP are both special cases of CoS problems
The complexity of a CoS depends on
- Relative scale: aggregation, disaggregation, hybrid
- Relative nesting: whether one set of units falls completely, neatly inside other
Nesting and scale
Illustration
Let’s consider three sets of units (from the U.S. state of Georgia)
Example: different definitions of same units across data sources
- Suppose one wants to change the support from precincts to constituencies
- scale: are source units smaller or larger than destination units?
- nesting: do source units fit completely/neatly into destination units?
- Suppose one wants to change the support from constituencies to grid cells
- scale: are source units smaller or larger than destination units?
- nesting: do source units fit completely/neatly into destination units?
- Change of support #1 looks like an aggregation of nested units
- Change of support #2 looks like (mostly?) disaggregation of non-nested units
Some considerations
- many CoS problems require both aggregation and disaggregation
- just because units are politically nested doesn’t mean they are geometrically nested (e.g. measurement error, imprecision of boundaries)
- not always easy to “eyeball” these things
- to get a better read on this, we need quantitative measures
Formally
- \(\mathcal{G}_S\): set of source polygons, indexed \(i=1,\dots,N_S\)
- \(\mathcal{G}_D\): set of destination polygons, indexed \(j=1,\dots,N_D\)
- \(\mathcal{G}_{S\cap D}\): intersection of \(\mathcal{G}_S\) & \(\mathcal{G}_D\), indexed \(i\cap j=1,\dots,N_{S\cap D}\)
- \(a_i\): area of source polygon \(i\);\(\quad\) \(a_j\): area of destination polygon \(j\)
- \(a_{i\cap j}\): area of intersection \(i\cap j\)
define relative scale as \(RS = \frac{1}{N_{S\cap D}}\sum_{i\cap j}^{N_{S\cap D}}1(a_i<a_j)\)
- values of 1 \(=\) aggregation; values of 0 \(=\) disaggregation; 0-1 \(=\) hybrid
define relative nesting as \(RN = \frac{1}{N_S}\sum_{i}^{N_S} \sum_j^{N_D}\left(\frac{a_{i\cap j}}{a_i} \right)^2\)
- values of 1 \(=\) full nesting; values of 0 \(=\) no nesting; 0-1 \(=\) partial nesting
Informally
- relative scale:
share of intersections where source units smaller than destination units
- relative nesting:
share of source units that cannot be split across destination units
Application of relative scale and nesting to Georgia data: any surprises here?
Relative scale
| (a) precincts |
– |
1.00 |
1.00 |
| (b) constituencies |
0.00 |
– |
0.12 |
| (c) .5\(^\circ\) grid |
0.00 |
0.89 |
– |
Relative nesting
| (a) precincts |
– |
0.98 |
0.92 |
| (b) constituencies |
0.01 |
– |
0.29 |
| (c) .5\(^\circ\) grid |
0.05 |
0.54 |
– |
Change of support algorithms
A CoS algorithm specifies a transformation between source and destination units
- \(x\): is a variable being transformed from support \(\mathcal{G}_S\) to \(\mathcal{G}_D\)
- \(x_{\mathcal{G}D}\):is true value of variable \(x\) in destination units \(\mathcal{G}_D\)
- \(\widehat{x_{\mathcal{G}D}}^{(k)}=f_k(x_{\mathcal{G}S})\): estimated value of \(x_{\mathcal{G}D}\), calculated w/ CoS algorithm \(k\)
these range from simple geometric operations to complex model-based predictions
Types of variables
Extensive (depend on area and scale)
- aggregates are (weighted) sums
- must satisfy the pycnophylactic (mass-preserving) property:
- if area is split or combined, its values must be split or combined
- sum of values in destination units must equal sum in source units
- examples: population counts, event counts, acreage, mineral deposits
Intensive (don’t depend on area and scale)
- aggregates are (weighted) means
- examples: population density, vote margins, median income
- intensive variables are often functions of extensive variables (density \(=\) mass/vol.)
- best practice: reconstruct in destination units from transformed components
(\(\widehat{\text{mass}}_{\mathcal{G}D}/\widehat{\text{volume}}_{\mathcal{G}D} = \widehat{\text{density}}_{\mathcal{G}D}\))
Areal interpolation
Areal weighting is the default CoS method in many commercial and open-source GIS
Advantages
- easy to implement
- requires information only on geometry of source and destination units
- no need for ancillary data
Disadvantages
- assumes that the phenomenon of interest is uniformly distributed in source units
- this becomes less problematic if source units are relatively small
- but more problematic as source units increase in size
Pseudocode for areal interpolation
- Intersect \(\mathcal{G}_{S}\) and \(\mathcal{G}_{D}\), creating a third polygon layer \(\mathcal{G}_{S\cap D}\),
- each feature \(i\cap j\in \{1,\dots,N_{S\cap D}\}\) is a part of source polygon \(i\) that falls inside destination polygon \(j\).
- Compute area weights for each intersection \(i\cap j\), proportional to
- for extensive variables: \(w_{i\cap j}^{\text{(ext)}}=\frac{a_{i\cap j}}{a_i}\)
(i.e. share of \(i\)’s area represented by intersection \(i\cap j\))
- for intensive variables: \(w_{i\cap j}^{\text{(int)}}=\frac{a_{i\cap j}}{a_j}\)
(i.e. share of \(j\)’s area contributed by intersection \(i\cap j\))
- Combine weighted statistics for each destination polygon \(j\):
- \(\hat{x}_j=\sum_{i\cap j}^{N_{\cap j}} w_{i\cap j}x_{i\cap j}\), where \(x_{i\cap j}\) is the value of \(x\) in intersection \(i\cap j\) and \(N_{\cap j}\) is the number of intersections in \(j\)
Areal interpolation is just one of many potential CoS methods
Examples:
- simple overlay
- population weighted interpolation
- ordinary kriging
- universal kriging
- thin-plate splines and random forests
these differ in their assumptions
(e.g. uniformity vs. heterogeneity) and requirements (e.g. ancillary data)
… what’s more important is not the choice of CoS algorithm, but the relative scale and nesting of source and destination units